from pyspark.sql.functions import col
# File location and type
file_location = "/FileStore/tables/telco.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
#Show on a pie chart the proportions of possible types of 'Contract'
Table
Visualization 1
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
Command took 3.08 seconds
display(df)
Table
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
Command took 1.06 seconds
#Create a dataframe containing 'customerID' and 'MonthlyCharges' df_f = df.select(['customerID', 'MonthlyCharges']) display(df_f)
Table
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
customerID
MonthlyCharges
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
29.85
5575-GNVDE
56.95
3668-QPYBK
53.85
7795-CFOCW
42.3
9237-HQITU
70.7
9305-CDSKC
99.65
1452-KIOVK
89.1
6713-OKOMC
29.75
7892-POOKP
104.8
6388-TABGU
56.15
9763-GRSKD
49.95
7469-LKBCI
18.95
8091-TTVAX
100.35
0280-XJGEX
103.7
5129-JLPIS
105.5
3655-SNQYZ
113.25
8191-XWSZG
20.65
7,043 rows
Command took 0.58 seconds
# What are the minimum and maximum values of tenure? Show these as single variables. (2 points)
import pyspark.sql.functions as f
df_min_max = df.agg(f.min(col('Tenure')).alias('Min Tenure'),f.max(col('Tenure')).alias('Max Tenure'))
display(df_min_max)Table
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
Min Tenure
Max Tenure
1
0
72
1 row
Command took 0.89 seconds
# What are the monthly charges depending on the gender of the customer?
monthly_charges_by_gender =df.groupBy(df['gender']).agg({'MonthlyCharges': 'avg'})
display(monthly_charges_by_gender)Table
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
gender
avg(MonthlyCharges)
1
2
Female
65.20424311926602
Male
64.32748241912773
2 rows
Command took 1.28 seconds
# Do customers who have a yearly contract pay on average more than the others? Return the result as a single bool variable.
# Please note that others in this context means that we treat together both the Two year and month-to-month customers. (2 points)
from pyspark.sql.functions import col
yearly_avg = df.filter(col('contract') == 'One year').agg({'MonthlyCharges': 'avg'}).collect()[0][0]
others_avg = df.filter((col('contract') == 'Two year') | (col('contract') == 'Month-to-month')).agg({'MonthlyCharges': 'avg'}).collect()[0][0]
yearly_pays_more = yearly_avg > others_avg
print(yearly_pays_more)True
Command took 1.41 seconds
#Create a column that will show the average charges computed as a division of Total Charges by tenure
df_e=df.withColumn('Average charges',f.round(col('TotalCharges')/col('Tenure'),2))
display(df_e)Table
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
Command took 1.40 seconds